Search CORE

159 research outputs found

Classification-Aware Hidden-Web Text Database Selection,

Author: Gravano Luis
Ipeirotis Panagiotis
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 06/03/2006
Field of study

Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying.We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words.We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

CiteSeerX

New York University Faculty Digital Archive

Recommended from our members

Extracting Relations from Large Plain-Text Collections

Author: Agichtein Eugene
Gravano Luis
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1999
Field of study

Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queriesor for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns,that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents

Columbia University Academic Commons

Recommended from our members

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

Author: Gravano Luis
Ipeirotis Panagiotis G.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases

Columbia University Academic Commons

Computing Geographical Scopes of Web Resources

Author: Ding Junyan
Gravano Luis
Shivakumar Narayanan
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2000
Field of study

Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are relevant primarily to web users in geographical proximity to these locations. In contrast, other information resources are relevant to a broader geographical community. For instance, an on-line newspaper may be relevant to users across the United States. Unfortunately, most current web search engines largely ignore the geographical scope of web resources. In this paper, we introduce techniques for automatically computing the geographical scope of web resources, based on the textual content of the resources, as well as on the geographical distribution of hyperlinks to them. We report an extensive experimental evaluation of our strategies using real web data. Finally, we describe a geographically-aware search engine that we have built using our techniques for determining the geographical scope of web resources

CiteSeerX

Columbia University Academic Commons

Combining Strategies for Extracting Relations from Text Collections

Author: Agichtein Eugene
Eskin Eleazar
Gravano Luis
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2000
Field of study

Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. Our Snowball system extracts these relations from document collections starting with only a handful of user-provided example tuples. Based on these tuples, Snowball generates patterns that are used, in turn, to find more tuples. In this paper we introduce a new pattern and tuple generation scheme for Snowball, with different strengths and weaknesses than those of our original system. We also show preliminary results on how we can combine the two versions of Snowball to extract tuples more accurately

CiteSeerX

Columbia University Academic Commons

Beyond Trending Topics: Real-World Event Identification on Twitter

Author: Becker Hila
Gravano Luis
Naaman Mor
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

User-contributed messages on social media sites such as Twitter have emerged as powerful, real-time means of information sharing on the Web. These short messages tend to reflect a variety of events in real time, earlier than other social media sites such as Flickr or YouTube, making Twitter particularly well suited as a source of real-time event content. In this paper, we explore approaches for analyzing the stream of Twitter messages to distinguish between messages about real-world events and non-event messages. Our approach relies on a rich family of aggregate statistics of topically similar message clusters, including temporal, social, topical, and Twitter-centric features. Our large-scale experiments over millions of Twitter messages show the effectiveness of our approach for surfacing real-world event content on Twitter

CiteSeerX

Columbia University Academic Commons

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

QProber: A System for Automatic Classification of Hidden-Web Resources

Author: Gravano Luis
Ipeirotis Panagiotis G.
Sahami Mehran
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web "crawlers." Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases

Columbia University Academic Commons

Towards a Query Optimizer for Text-Centric Tasks

Author: Agichtein Eugene
Gravano Luis
Ipeirotis Panagiotis
Jain Pranay
Publication venue: ACM Transactions on Database Systems
Publication date: 28/10/2006
Field of study

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

New York University Faculty Digital Archive